Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

(

)

(

)

e test was used for the generalisation test. Figure 7.17 (b) shows

curves as well as AUC values of three models. The AUC values

ost one, meaning a perfect discrimination power between SARS-

SARS-CoV-2 genomes based on this alignment-free multiple

comparison.

ole genome pattern discovery for SARS-CoV-2

le genome pattern discovery for SARS-CoV-2 is shown in this

Both sequences and metadata were downloaded from the Global

on Sharing All Influenza Data [Shu and McCauley, 2017] on the

ry 2021. There were 315,253 sequences across over 200 countries

ns in total till that date.

countries with the highest infection numbers were selected for the

tion in this chapter. They were USA, India, Russia and Brazil.

ere 57,836, 4,325, 1,572 and 1,811 sequences for USA, India,

nd Brazil, respectively. After removing duplicated sequences,

e 51,383, 4,321, 1,502 and 1,771 sequences left for USA, India,

nd Brazil, respectively. There were finally 58,897 sequences from

ntries. Each sequence was coded using the 3-mer approach.

e, a numeric vector of 64 3-mers or words for each sequence, i.e.,

∈࣬^଺ସ.

types of models were constructed for this whole genome pattern

y problem. First, unsupervised machine learning models were

ed to visualise how the genomic patterns (the 3-mer word

y library) of the viral sequences were distributed in these four

. What this analysis aims to do is to examine whether the set of

s of all sequences (X) can be efficiently and accurately divided

subsets, namely Ω௎ௌ஺, Ωூ௡ௗ௜௔, Ω஻௥௔௭௜௟ and Ωோ௨௦௦௜௔, using a

ised machine learning model, ݂ሺ܆ሻ.

݂ሺ܆ሻ⟹⋃ሼΩ௎ௌ஺, Ωூ௡ௗ௜௔, Ω஻௥௔௭௜௟, Ωோ௨௦௦௜௔ሽ